### DEGREE OF FLEXIBILITY... REGRESSION VS
SMOOTH SPLINES ###
> setwd("C:/Users/baron/627 Statistical Machine Learning/data")
> load("Auto.rda")
> attach(Auto)
# Fit REGRESSION model predicting miles per gallon based on weight.
# Plot the regression line in red color with thickness=3
> reg = lm(mpg~weight)
> plot(weight,mpg)
> abline(reg,col="red",lwd=3)
# In general, abline(a,b) plots the line y = a + bx
# Fit a SPLINE with 2 degrees of freedom (straight line) to our data
and plot it.
> spline2 = smooth.spline(weight,mpg,df=2)
> lines(spline2,col="blue")
# Add one more degree of freedom -> quadratic spline
> spline3 = smooth.spline(weight,mpg,df=3)
> lines(spline3,col="green",lwd=4)
# Increase flexibility by adding more d.f.
> spline20 = smooth.spline(weight,mpg,df=20)
> lines(spline20,col="brown",lwd=4)
# Blow flexibility to 100 degrees of freedom.
# The resulting spline is heavily dependent
on each data point.
# Its prediction power is very low.
> spline100 = smooth.spline(weight,mpg,df=100)
> lines(spline100,col="orange",lwd=4)
# While spline100 is very flexible, and it
matches the data most closely, it would not be powerful for prediction. We’ll learn how to measure prediction accuracy with various
cross-validation tools.
How to choose the optimal method?
Cross-validation technique.
> n = length(mpg); Z = sample(n,n/2); attach(Auto[Z,]); # This will be our training
data
> ss5 = smooth.spline(weight,
mpg, df=5) # Fit the spline using training data only
> attach(Auto);
> Yhat
= predict(ss5,
x=weight)
> names(Yhat) # Notice: prediction consists of two parts
– predictor x and
[1] "x" "y" # predicted response y. We can call them Yhat$x and Yhat$y.
> mean((
Yhat$y[-Z] - mpg[-Z] )^2) # Then, compute prediction mean-squared
error on test data
[1] 17.76551 # This is the
cross-validation error for a spline with df=5
# Try many different splines and choose the one
with the smallest prediction error.
> cv.err
= rep(0,50);
> for
(p in 1:50){
+ attach(Auto[Z,]); ss = smooth.spline(weight, mpg, df=1+p/10) #
Try DF = 1.1, 1.2, …, 6.0
+ attach(Auto); Yhat = predict(ss, weight) # DF must be > 1
+ cv.err[p]
= mean( (Yhat$y[-Z] -
mpg[-Z])^2 )
}
> df = 1+(1:50)/10
> plot(df,cv.err)
> which.min(cv.err)
[1] 20
> df[20]
[1] 3
# This cross-validation
method chooses the spline with 3 d.f.